The bulk and the tail of minimal absent words in genome sequences.

نویسندگان

  • Erik Aurell
  • Nicolas Innocenti
  • Hai-Jun Zhou
چکیده

Minimal absent words (MAW) of a genomic sequence are subsequences that are absent themselves but the subwords of which are all present in the sequence. The characteristic distribution of genomic MAWs as a function of their length has been observed to be qualitatively similar for all living organisms, the bulk being rather short, and only relatively few being long. It has been an open issue whether the reason behind this phenomenon is statistical or reflects a biological mechanism, and what biological information is contained in absent words. In this work we demonstrate that the bulk can be described by a probabilistic model of sampling words from random sequences, while the tail of long MAWs is of biological origin. We introduce the concept of a core of a MAW, which are sequences present in the genome and closest to a given MAW. We show that in E. faecalis, E. coli and yeast the cores of the longest MAWs, which exist in two or more copies, are located in highly conserved regions the most prominent example being ribosomal RNAs. We also show that while the distribution of the cores of long MAWs is roughly uniform over these genomes on a coarse-grained level, on a more detailed level it is strongly enhanced in 3' untranslated regions (UTRs) and, to a lesser extent, also in 5' UTRs. This indicates that MAWs and associated MAW cores correspond to fine-tuned evolutionary relationships, and suggest that they can be more widely used as markers for genomic complexity.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Minimal Absent Words in Four Human Genome Assemblies

Minimal absent words have been computed in genomes of organisms from all domains of life. Here, we aim to contribute to the catalogue of human genomic variation by investigating the variation in number and content of minimal absent words within a species, using four human genome assemblies. We compare the reference human genome GRCh37 assembly, the HuRef assembly of the genome of Craig Venter, ...

متن کامل

Profile of Eight Prophage Sequences Present in the Genomes of Different Acinetobacter baumannii Strains

ABSTRACT           Background and Objective: Prophage sequences are major contributors to interstrain variations within the same bacterial species. Acinetobacter baumannii is a gram-negative bacterium that causes a wide range of nosocomial infections, especially in intensive care unit inpatients. Prophage sequences constitute a considerable proporti...

متن کامل

O-44: Characterisation of Monotreme CaseinsReveals Lineage Specific Expansion of an AncestralCasein Locus in Mammals

Background: One important reproductive characteristic of Mammals is the production of milk to nurse the neonate. In order to better understand the evolution of milk we have investigated gene expression in milk cells from monotremes which are the most ancient representative of the mammalian lineage. Materials and Methods: Using a milk cell cDNA sequencing approach we characterise milk protein se...

متن کامل

Phylogenetic relationships of Iranian Infectious Pancreatic Necrosis Virus (IPNV) based on deduced amino acid sequences of genome segment A and B cDNA

Infectious Pancreatic Necrosis Virus (IPNV) is the causal agent of a highly contagious disease that affects many species of fish and shellfish. This virus causes economically important diseases of farmed rainbow trout, Oncorhynchus mykiss, in Iran which is often associated with the transmission of pathogens from European resources. In this study, moribund rainbow trout fry were collected during...

متن کامل

Phylogenetic relationships of Iranian Infectious Pancreatic Necrosis Virus (IPNV) based on deduced amino acid sequences of genome segment A and B cDNA

Infectious Pancreatic Necrosis Virus (IPNV) is the causal agent of a highly contagious disease that affects many species of fish and shellfish. This virus causes economically important diseases of farmed rainbow trout, Oncorhynchus mykiss, in Iran which is often associated with the transmission of pathogens from European resources. In this study, moribund rainbow trout fry were collected during...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Physical biology

دوره 13 2  شماره 

صفحات  -

تاریخ انتشار 2016